Visual Explanations from Hadamard Product in Multimodal Deep Networks

نویسندگان

  • Jin-Hwa Kim
  • Byoung-Tak Zhang
چکیده

The visual explanation of learned representation of models helps to understand the fundamentals of learning. The attentional models of previous works used to visualize the attended regions over an image or text using their learned weights to confirm their intended mechanism. Kim et al. (2016) show that the Hadamard product in multimodal deep networks, which is well-known for the joint function of visual question answering tasks, implicitly performs an attentional mechanism for visual inputs. In this work, we extend their work to show that the Hadamard product in multimodal deep networks performs not only for visual inputs but also for textual inputs simultaneously using the proposed gradient-based visualization technique. The attentional effect of Hadamard product is visualized for both visual and textual inputs by analyzing the two inputs and an output of the Hadamard product with the proposed method and compared with learned attentional weights of a visual question answering model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generalized Hadamard-Product Fusion Operators for Visual Question Answering

We propose a generalized class of multimodal fusion operators for the task of visual question answering (VQA). We identify generalizations of existing multimodal fusion operators based on the Hadamard product, and show that specific nontrivial instantiations of this generalized fusion operator exhibit superior performance in terms of OpenEnded accuracy on the VQA task. In particular, we introdu...

متن کامل

A Critical Visual Analysis of Gender Representation of ELT Materials from a Multimodal Perspective

This content analysis study, employing a multimodal perspective and critical visual analysis, set out to analyze gender representations in Top Notch series, one of the highly used ELT textbooks in Iran. For this purpose, six images were selected from these series and analyzed in terms of ‘representational’, ‘interactive’ and ‘compositional’ modes of meanings. The result indicated that there are...

متن کامل

Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)

Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annot...

متن کامل

Multimodal Residual Learning for Visual QA

Deep neural networks continue to advance the state-of-the-art of image recognition tasks with various methods. However, applications of these methods to multimodality remain limited. We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning. Unlike the deep residual learning, MRN effectiv...

متن کامل

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.06228  شماره 

صفحات  -

تاریخ انتشار 2017